Skip to content

ENH: Support MultiIndex columns in parquet (#34777) #36305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 19, 2020

Conversation

hweecat
Copy link
Contributor

@hweecat hweecat commented Sep 12, 2020

Support MultiIndex for columns in parquet format by updating value column names check to handle MultiIndexes.

Copy link
Member

@dsaxton dsaxton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test?

@dsaxton dsaxton added Enhancement IO Parquet parquet, feather labels Sep 12, 2020
1. Update check to handle MultiIndex columns for parquet format
2. Edit whatsnew entry.
3. Add test for writing MultiIndex columns with string column names
1. Include issue number as a comment on added test
if not all(
x.inferred_type in {"string", "empty"} for x in df.columns.levels
):
raise ValueError("parquet must have string column names")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can say something about 'for all values in each level of the MultiIndex'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jreback for the suggestion on the exception statement - adding that into my next commit!

mi_columns = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)])
df = pd.DataFrame(np.random.randn(4, 3), columns=mi_columns)
self.check_error_on_write(df, engine, ValueError)

def test_write_column_multiindex_string(self, pa):
# GH #34777
# Not supported in fastparquet as of 0.1.3 or older pyarrow version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we > than the min pyarrow version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the min versions listed in pandas dependencies, the min pyarrow version is 0.15 while we are currently at 0.16 - at least for the dev environment that I'm working on.

["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
df = pd.DataFrame(np.random.randn(8, 8), columns=arrays)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add names to the MultiIndex levels. do these round trip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding names to the MultiIndex levels, looks like they do round trip on pytest.

@@ -297,6 +297,7 @@ I/O
- :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`)
- :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`)
- :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`)
- :meth:`to_parquet` did not support :class:`MultiIndex` for columns in parquet format (:issue:`34777`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would move this to other enhancements; say this is enabled with pyarrow=.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the whatsnew entry to "other enhancements" - thanks!

1. Add tests for writing Indexes and MultiIndexes for columns
2. Edit message for check to handle  MultiIndex columns for parquet
3. Edit whatsnew entry to move entry to other enhancements
@pep8speaks
Copy link

pep8speaks commented Sep 14, 2020

Hello @hweecat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-19 01:05:10 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, can you add a whatsnew note; enhancements in 1.2

does this have a min pyarrow version?

@jreback jreback added this to the 1.2 milestone Oct 10, 2020
@hweecat
Copy link
Contributor Author

hweecat commented Oct 11, 2020

@jreback Since the enhancements work on pyarrow 0.15.0, we could leave the min pyarrow version as >= 0.15.0 for this pull request.

I've added a whatsnew note on enhancements in 1.2 under "Other enhancements" - feel free to let me know if a more detailed whatsnew note is needed. :)

P.S. Tests are failing after I did a rebase to update my branch.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 11, 2020
@hweecat hweecat force-pushed the io-parquet-multiindex branch from 05a2f62 to 039094c Compare November 11, 2020 03:43
@alimcmaster1
Copy link
Member

alimcmaster1 commented Nov 14, 2020

Test failures look unrelated, looks related to #37818

fsspectest = <pandas.conftest.fsspectest.<locals>.TestMemoryFS object at 0x7fb2b99e5e50>
extension = 'xls'

    @pytest.mark.parametrize("extension", ["xlsx", "xls"])
    def test_excel_options(fsspectest, extension):
        df = DataFrame({"a": [0]})
    
        path = f"testmem://test/test.{extension}"
    
>       df.to_excel(path, storage_options={"test": "write"}, index=False)

pandas/tests/io/test_fsspec.py:133: 

@charlesdong1991
Copy link
Member

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is back green, one minor comment on whatsnew

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

@hweecat small comments, can you merge master and ping on green.

@charlesdong1991 charlesdong1991 mentioned this pull request Nov 18, 2020
1 task
@jreback jreback merged commit 5f2bac5 into pandas-dev:master Nov 19, 2020
@jreback
Copy link
Contributor

jreback commented Nov 19, 2020

thanks @hweecat and @charlesdong1991

@charlesdong1991
Copy link
Member

all credits to @hweecat very nice job!

@mraxilus
Copy link

Any reason why this explicitly raises errors for non-string multi-indicies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support Multi-Index for columns in parquet format
7 participants